2024-08-29
Please read these instructions before writing.
Regarding Question 4, Professor Love is the large fellow standing in the front of the room.
The key distinction we’ll make is between
Information that is quantitative describes a quantity.
Continuous variables (can take any value in a range) vs. Discrete variables (limited set of potential values)
We can also distinguish interval (equal distance between values, but zero point is arbitrary) from ratio variables (meaningful zero point.)
Qualitative variables consist of names of categories.
Over 10 years, 547 people took (essentially) the same survey in the same way.
| Fall | 2023 | 2022 | 2021 | 2020 | 2019 |
|---|---|---|---|---|---|
| n | 53 | 54 | 58 | 67 | 61 |
| Fall | 2018 | 2017 | 2016 | 2015 | 2014 | Total |
|---|---|---|---|---|---|---|
| n | 51 | 48 | 64 | 49 | 42 | 547 |
About how many of those 547 surveys caused no problems in recording responses?
| # | Topic | # | Topic |
|---|---|---|---|
| Q1 | glasses |
Q9 | lectures_vs_activities |
| Q2 | english |
Q10 | projects_alone |
| Q3 | stats_so_far |
Q11 | height |
| Q4 | guess_TL_ht |
Q12 | hand_span |
| Q5 | smoke |
Q13 | color |
| Q6 | handedness |
Q14 | sleep |
| Q7 | stats_future |
Q15 | pulse_rate |
| Q8 | haircut |
- | - |
sex rather than glasses.About how many of those 547 surveys caused no problems in recording responses?
What should we do in these cases?
pulse responses, sorted (n = 61, 1 NA) 33 46 48 56 60 60 3 | 3
62 63 65 65 66 66 4 | 68
68 68 68 69 70 70 5 | 6
70 70 70 70 70 70 6 | 002355668889
71 72 72 74 74 74 7 | 00000000122444445666888
74 74 75 76 76 76 8 | 000012445668
78 78 78 80 80 80 9 | 000046
80 81 82 84 84 85 10 | 44
86 86 88 90 90 90 11 | 0
90 94 96 104 104 110
(Thanks, John Tukey )
| Group | Within 2 | Within 5 | Too Low | Correct | Too High | Beat AI | |
|---|---|---|---|---|---|---|---|
| The Confident Interval | 3 | 5 | 1 | 1 | 8 | 6 | |
| MAWC | 3 | 7 | 4 | 1 | 5 | 7 | |
| The Renaissance Coders | 1 | 7 | 2 | 1 | 7 | 6 | |
| R-rational | 2 | 6 | 5 | 1 | 4 | 7 | |
| TVMB | 6 | 9 | 1 | 1 | 8 | 7 | |
| Something Creative & Original | 2 | 5 | 5 | 1 | 4 | 5 | |
| AI | 2 | 4 | 7 | 1 | 2 | – |
These six groups (and the AI at https://howolddoyoulook.com/) each guessed one age correctly. The other seven groups are shown on the next slide.
| Group | Within 2 | Within 5 | Too Low | Correct | Too High | Beat AI | |
|---|---|---|---|---|---|---|---|
| Baked Split | 4 | 8 | 3 | 0 | 7 | 6 | |
| Pineapple Pizza | 4 | 8 | 5 | 0 | 5 | 7 | |
| CWRU Crew | 4 | 4 | 5 | 0 | 5 | 6 | |
| Statasaurous rex | 2 | 6 | 3 | 0 | 7 | 5 | |
| Tukey 60 | 3 | 8 | 4 | 0 | 6 | 5 | |
| Beat the Curve | 0 | 5 | 4 | 0 | 6 | 5 | |
| Stats Avengers | 4 | 6 | 3 | 0 | 7 | 5 |
| Group | Mean Error | SD (Errors) | Median Error | (Min, Max) Error |
|---|---|---|---|---|
| The Confident Interval | 5 | 7.3 | 3 | -8, 16 |
| MAWC | 1 | 7 | 1 | -11, 14 |
| The Renaissance Coders | 2.6 | 6.1 | 4 | -9, 13 |
| R-rational | 0.8 | 5.6 | -1 | -6, 10 |
| TVMB | 2 | 2.5 | 1 | -2, 7 |
| Something Creative and Original | -0.2 | 6.2 | -0.5 | -8, 8 |
| AI | -5 | 7.3 | -5.5 | -15, 6 |
| Group | Mean Error | SD (Errors) | Median Error | (Min, Max) Error |
|---|---|---|---|---|
| Baked Split | 3.4 | 5.7 | 2 | -3, 14 |
| Pineapple Pizza | 1 | 6.3 | 0.5 | -6, 16 |
| CWRU Crew | -2 | 7.3 | 0 | -13, 9 |
| Statasaurous rex | 1.8 | 5.7 | 4 | -10, 8 |
| Tukey 60 | 2.1 | 6.3 | 1 | -5, 16 |
| Beat the Curve | 2 | 7.5 | 3.5 | -8, 13 |
| Stats Avengers | 1.9 | 9.4 | 3 | -21, 15 |
| Group | Mean AE | Range (AE) | Median AE | RMSE |
|---|---|---|---|---|
| The Confident Interval | 6.6 | 0, 16 | 5.5 | 8.5 |
| MAWC | 5.2 | 0, 14 | 3.5 | 6.7 |
| The Renaissance Coders | 5.4 | 0, 13 | 4.5 | 6.3 |
| R-rational | 4.6 | 0, 10 | 4.5 | 5.3 |
| TVMB | 2.4 | 0, 7 | 1.5 | 3.1 |
| Something Creative and Original | 5.2 | 0, 8 | 5.5 | 5.9 |
| AI | 7 | 0, 15 | 6 | 8.5 |
| Group | Mean AE | Range (AE) | Median AE | RMSE |
|---|---|---|---|---|
| Baked Split | 4.4 | 1, 14 | 3 | 6.4 |
| Pineapple Pizza | 4.4 | 1, 16 | 3 | 6 |
| CWRU Crew | 5.8 | 1, 13 | 6 | 7.2 |
| Statasaurous rex | 5.2 | 2, 10 | 5 | 5.7 |
| Tukey 60 | 4.7 | 1, 16 | 4 | 6.4 |
| Beat the Curve | 6.6 | 3, 13 | 5.5 | 7.4 |
| Stats Avengers | 6.5 | 1, 21 | 4 | 9.1 |
| AI | 7 | 0, 15 | 6 | 8.5 |
photos <-
read_csv("c02/data/ten-photo-age-history-2024.csv",
show_col_types = F)
photos <- photos |>
mutate(label = fct_reorder(label, card))
head(photos)# A tibble: 6 × 13
order card label age sex facing year mean_guess error abs_error sq_error
<dbl> <dbl> <fct> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 Chong 21 M R 2024 25.3 4.3 4.3 18.5
2 2 2 Arch… 64 F L 2024 56.8 -7.2 7.2 51.8
3 3 3 Mayf… 28 F L 2024 31.4 3.4 3.4 11.6
4 4 4 Love 14 M L 2024 15.1 1.1 1.1 1.21
5 5 5 McGi… 54 F R 2024 63.5 9.5 9.5 90.2
6 6 6 Chan… 74 M L 2024 72.9 -1.1 1.1 1.21
# ℹ 2 more variables: `detailed description` <chr>, jpeg <chr>
.csv fileI’ve placed love-age-guesses-2022-2024.csv on our 431-data page. This includes guesses from 2022-2024.
age_guess TibbleClicking on RAW in the 431-data presentation takes us to a (long) URL that contains the raw data in this sheet.
I’ll read in the sheet’s data to a new tibble (a special kind of R data frame) called age_guess using the read_csv() function.
age_guess tibbleWhat do we get?
# A tibble: 148 × 5
student guess1 guess2 actual year
<chr> <dbl> <dbl> <dbl> <dbl>
1 S-2022-01 57 62 55.5 2022
2 S-2022-02 53 53 55.5 2022
3 S-2022-03 50 50 55.5 2022
4 S-2022-04 48 56 55.5 2022
5 S-2022-05 61 NA 55.5 2022
6 S-2022-06 63 63 55.5 2022
7 S-2022-07 67 58 55.5 2022
8 S-2022-08 50 57 55.5 2022
9 S-2022-09 50 50 55.5 2022
10 S-2022-10 43 56 55.5 2022
# ℹ 138 more rows
How many first guesses in each year were less than 57.5?
guess1 values look like?guess1 values?Change theme, specify bin width rather than number of bins
Add a vertical line at 57.5 years to show my actual age.
ggplot(age_guess,
aes(x = guess1)) +
geom_histogram(binwidth = 2,
col = "white", fill = "blue") +
geom_vline(aes(xintercept = 56), col = "red") +
theme_bw() +
labs(
x = "First Guess of Dr. Love's Age",
y = "Fall 2022-2024 431 students",
title = "Pretty wide range of guesses",
subtitle = "Dr. Love's Actual Age = 55.5 in 2022, 57.5 in 2024")Create three facets, for 2022, 2023 and 2024 guesses…
ggplot(age_guess,
aes(x = guess1, fill = factor(year))) +
geom_histogram(binwidth = 2, col = "white") +
theme_bw() +
facet_grid(year ~ .) +
labs(
x = "First Guess of Dr. Love's Age",
y = "# of Students",
title = "Distribution of guesses over the past three years",
subtitle = "Dr. Love's Actual Age = 55.5 in 2022, 57.5 in 2024") student guess1 guess2 year
Length:148 Min. :40.00 Min. :40.00 Min. :2022
Class :character 1st Qu.:50.00 1st Qu.:53.00 1st Qu.:2022
Mode :character Median :55.00 Median :56.00 Median :2023
Mean :54.65 Mean :56.08 Mean :2023
3rd Qu.:58.00 3rd Qu.:59.00 3rd Qu.:2024
Max. :72.00 Max. :70.00 Max. :2024
NA's :3
NA's : 3 mean in guess2?student not summarized any further?# A tibble: 56 × 5
student guess1 guess2 actual year
<chr> <dbl> <dbl> <dbl> <dbl>
1 S-2024-01 45 53 57.5 2024
2 S-2024-02 50 55 57.5 2024
3 S-2024-03 65 62 57.5 2024
4 S-2024-04 55 60 57.5 2024
5 S-2024-05 65 67 57.5 2024
6 S-2024-06 58 56 57.5 2024
7 S-2024-07 58 56 57.5 2024
8 S-2024-08 60 55 57.5 2024
9 S-2024-09 56 53 57.5 2024
10 S-2024-10 62 58 57.5 2024
# ℹ 46 more rows
guess1
44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
1 1 1 2 1 4 1 2 3 1 5 5 3 6 6 4 1 3 1 1 2 1 1
4 | 4
4 | 56778
5 | 00001223334
5 | 5555566666777888888999999
6 | 0000122234
6 | 5567
Variable | Mean | SD | IQR | 90% CI | Range | Skewness | Kurtosis | n | n_Missing
-------------------------------------------------------------------------------------------------------
guess1 | 56.25 | 5.39 | 6.75 | [55.01, 57.39] | [44.00, 67.00] | -0.31 | -0.25 | 56 | 0
guess2 | 57.61 | 4.55 | 5.75 | [56.62, 58.34] | [47.00, 67.00] | -0.07 | -0.06 | 56 | 0
describe_distribution(age_24 |> select(guess1, guess2),
centrality = "median", ci = 0.90,
range = FALSE, quartiles = TRUE)Variable | Median | MAD | IQR | 90% CI | Quartiles | Skewness | Kurtosis | n | n_Missing
------------------------------------------------------------------------------------------------------
guess1 | 57.00 | 4.45 | 6.75 | [55.98, 58.00] | 53.00, 59.25 | -0.31 | -0.25 | 56 | 0
guess2 | 57.50 | 3.71 | 5.75 | [57.00, 59.00] | 55.00, 60.25 | -0.07 | -0.06 | 56 | 0
lovedist() function from the Love-431.R script# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 56 0 56.2 5.39 57 4.45 44 53 59.2 67
# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 56 0 57.6 4.55 57.5 3.71 47 55 60.2 67
Create new variable (change = guess2 - guess1)
What will this look like?
Call:
lm(formula = guess2 ~ guess1, data = age_guess)
Coefficients:
(Intercept) guess1
22.6760 0.6118
lm filters to complete cases by default.stan_glm
family: gaussian [identity]
formula: guess2 ~ guess1
observations: 145
predictors: 2
------
Median MAD_SD
(Intercept) 22.8 2.7
guess1 0.6 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 3.7 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
ggplot(data = temp |> filter(year == "2024"), aes(x = guess1, y = guess2)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, col = "blue") +
geom_abline(intercept = 0, slope = 1, col = "red") +
geom_text(x = 47, y = 45, label = "y = x", col = "red") +
labs(x = "First Guess of Love's Age",
y = "Second Guess of Love's Age",
title = "Student Guesses of Dr. Love's Age in 2024",
subtitle = "Love's actual age = 57.5 in 2024") +
theme_bw()431 Class 02 | 2024-08-29 | https://thomaselove.github.io/431-2024/